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Abstract 

Temporal information can provide useful features for 
recognizing facial expressions. However, to manually de¬ 
sign useful features requires a lot of effort. In this paper, 
to reduce this effort, a deep learning technique which is 
regarded as a tool to automatically extract useful features 
from raw data, is adopted. Our deep network is based on 
two different models. The first deep network extracts tempo¬ 
ral geometry features from temporal facial landmark points, 
while the other deep network extracts temporal appearance 
features from image sequences. These two models are com¬ 
bined in order to boost the performance of the facial expres¬ 
sion recognition. Through several experiments, we showed 
that the two models cooperate with each other. As a result, 
we achieved superior performance to other state-of-the-art 
methods in CK+ and Oulu-CASIA databases. Furthermore, 
one of the main contributions of this paper is that our deep 
network catches the facial action points automatically. 

1. Introduction 

Recognizing an emotion from a facial image is a classic 
problem in the field of computer vision area, and many stud¬ 
ies have been conducted. Recently, facial expression recog¬ 
nition research has been performed to increase the recog¬ 
nition performance by extracting useful temporal features 
from the image sequences [15, 19, 12]. In general, such 
spatio-temporal feature extractors are manually designed, 
which is a difficult task. For example, Figure 1 shows the 
facial landmark points of an image sequence with four emo¬ 
tions: anger, happiness, surprise, and fear. In order to clas¬ 
sify the four emotions using the landmark points, one may 
use a temporal change of distance between blue and red 
points over time. However, this representation can some¬ 
times be ambiguous. This is just a guess, so we cannot 
assure that the information in the representation is really 
useful. 

Well-known deep learning algorithms, such as the deep 
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Figure 1 . The Sequences of Facial Landmark Points. Each row 
represents the sequences of the facial landmark points for each 
emotion. (Top to bottom: anger, happiness, surprise, and fear) The 
change of the vector from the red point to blue point over time can 
aid in facial expression recognition, but no one knows whether the 
representation is really helpful. In this paper, we use deep learning 
techniques to extract useful geometric representations rather than 
a heuristic feature. 


neural network (DNN) and the convolutional neural net¬ 
work (CNN), have an ability to automatically extract use¬ 
ful representation from raw data (e.g., image data). How¬ 
ever, there is a limit in applying it directly in facial expres¬ 
sion recognition databases such as CK+ [13], MMI [17] and 
Oulu-CASIA [21]. The major reason is that the amount of 
data is very small, so a deep network that has many param¬ 
eters can easily fall into overfitting when training a deep 
network. This is directly related to decreased accuracy. Fur¬ 
thermore, if the training data is high dimensional, the over¬ 
fitting problem is more crucial. 

In this paper, we are interested in recognizing facial ex¬ 
pressions using image sequence data with a deep network. 
In order to overcome the problem of a small amount of data, 
we construct two small deep networks that complement 
each other. One of the deep networks is trained using image 
sequences, while the other deep network learns the tempo¬ 
ral trajectories of facial landmark points. In other words, the 


1 


first network focuses more on appearance changes of facial 
expressions over time, while the second network is directly 
related to the motion of facial organs. We utilize CNN for 
training based on image sequences, and we use DNN for 
training based on temporal facial landmark points. Through 
this process, our two deep networks learn the emotional fa¬ 
cial actions in different ways. Our main contributions in this 
paper are as follows: 

• Two deep network models are presented in order to 
extract useful temporal representations from two 
kinds of sequential data: image sequences and the 
trajectories of landmark points. 

• We observed that the two networks automatically de¬ 
tect moving facial parts and action points, respec¬ 
tively. 

• By integrating these two networks with different char¬ 
acteristics, performance improvement is achieved in 
terms of the recognition rates in both CK+ and Oulu- 
CASIA databases. 

Another advantage of our method compared to other 
state-of-the-art algorithms is that it is easy to implement us¬ 
ing any deep learning codes, such as CudaConvnet2, theano 
and Caffe [10, 3, £]. Also, the codes of the preprocessing 
algorithms used in our method were already opened, and 
they can be easily obtained from their websites [20, 2, 2A ]. 
Furthermore, the number of parameters of our deep network 
is not many, and we utilize rectified linear unit (ReLU) [4], 
so the prediction can be performed fast. 

2. Related Work 

2.1. Deep Learning-based Method 

Typically the CNN uses the single image, but CNN can 
also be used for temporal recognition problems such as ac¬ 
tion recognition [15]. In this 3D CNN method, the filters 
are shared along the time axis. Also, this method has been 
applied to facial expression recognition with deformable ac¬ 
tion part constraints, which is called 3D CNN-DAP [11]. 
The 3D CNN-DAP method is basically based on 3D CNN 
and uses the strong spatial structural constraints of the dy¬ 
namic action parts. It could receive a performance boost 
from using the hybrid method, but it falls short of the per¬ 
formance of other state-of-the-art methods. 

2.2. Hand-crafted Feature-based Method 

Many studies in this field have been conducted. Tra¬ 
ditional local features such as HOG, SIFT, and LBP have 
been extended in order to be applicable to video, and these 
are called 3D HOG [ 9 ], 3D SIFT [ 16 ] and LBP-TOP [ 22 ], 
respectively. Also, there was an attempt to improve ac¬ 
curacy through temporal modelings of each facial shape 


(TMS) [ 7 ] . They used conditional random fields and shape- 
appearance features created by hand. 

Recently, spatio-temporal covariance decriptors with the 
Riemannian locality preserving projection approach were 
developed (Cov3D) [15], and an interval temporal Bayesian 
network (ITBN) for capturing complex spatio-temporal re¬ 
lations among muscles was proposed [ 19 ]. Also, most re¬ 
cently, expressionlet-based spatio-temporal manifold repre¬ 
sentation was developed (STM-ExpLet) [12]. The perfor¬ 
mance of this technique is best in CK+ and MMI databases. 

3. Motivation 

Human facial movement over time is very closely asso¬ 
ciated with emotions. For example, both ends of the lips go 
up when people make a happy expression. 

Such movements of the face can be defined as a facial 
action coding system (FACS) [1]. In particular, there is an 
emotional facial action coding system (EFACS) to classify 
the movement of the face according to facial expression [1]. 
The majority of facial expression recognition databases fol¬ 
low this system. 

Our approach is inspired by the EFACS. We are inter¬ 
ested in finding meaningful temporal facial action features 
to improve the performance of facial expression recogni¬ 
tion without using any prior information about EFACS. To 
achieve this, we adopt a deep learning technique to extract 
useful features automatically. However, it was difficult to 
catch the facial action information from the face using only 
CNN. Consequently, we also used temporal facial landmark 
points, which represent the movement of specific parts of 
face. 

4. Our Approach 

We utilize deep learning techniques in order to recognize 
facial expressions. Basically, two deep networks are com¬ 
bined: the deep temporal appearance network (DTAN) and 
the deep temporal geometry network (DTGN). The DTAN, 
which is based on CNN, is used to extract the temporal 
appearance feature necessary for facial expression recog¬ 
nition. The DTGN, which is based on DNN, catches ge¬ 
ometrically moving information from the facial landmark 
points. Finally, these two models are integrated in order 
to increase the expression recognition performance. This 
network is called the deep temporal appearence-geometry 
network (DTAGN). The flowchart for our overall process is 
shown in Figure 2. 

4.1. Preprocessing 

In general, the length of image sequences is variable, but 
the input dimension is usually fixed in a deep network. Con¬ 
sequently, the normalization along the time axis is required 



<Landmark Points> <Concatenate> <FC> <Softmax> 


Figure 2. Overall Structure of Our Approach. The green box with a dotted line represents the preprocessing step of our method. The 
blue and red boxes with dotted lines correspond to the two architectures of the deep networks. Our two deep networks receive an image 
sequence and facial landmark points as input, respectevely. Conv and FC refer to the convolutional and fully connected layers, respectively. 
Finally, the outputs of these networks are integrated using weighted summation, which is represented in the purple box. 


as input for the networks. We adopted the method in [24], 
which makes a image sequence into a fixed length., 

Then, the faces in the input image sequences are de¬ 
tected, cropped, and rescaled to 64x64. From these de¬ 
tected faces, facial landmark points are extracted using the 
algorithm called IntraFace [20]. This algorithm provides 
accurate facial landmark points consisting of 48 landmark 
points, including two eyes, a nose, a mouth and two eye¬ 
brows. 

4.2. Deep Temporal Appearance Network 

In this paper, CNN is used for capturing temporal 
changes of appearance. Conventional CNN uses still im¬ 
ages as input, and 3D CNN was presented recently for deal¬ 
ing with image sequences. As mentioned in Section 2, 
the 3D CNN method shares the 3D filters along the time 
axis [15]. However, we use the n-image sequences without 
weight sharing along the time axis. This means that each 
filter plays a different role depending on the time. The acti¬ 
vation value of the first layer is defined as follows: 

T a R S 

fx,y,i = cr(^ 4+r,y+ S ' w r},i + b i )> C 1 ) 

t= 1 r= 0 s=0 

where f x ,y,i is the activation value of position (x, y) of the 
i-th feature map. R and S are the number of rows and 
columns of the filter, respectively. T a is the total frame 
number of the input grayscale image sequences. I^_ r ^ y+S 
means that the value at the position {x + r, y + s) of the 
input frame at time t. w^\ i is the i- th filter value at (r, s) 
for the t- th frame, and bi is the bias for the i- th filter, cr(-) is 
an activation function, which is usually a non-linear func¬ 
tion. Also, we utilize a ReLU, a{x) = max{ 0,x) as an 
activation function where x is an input value [ 4 ]. 

The other layers are not different from the conventional 
CNN as follows: the output of the convolutional layer is 


rescaled to half-size in a pooling layer for efficient calcula¬ 
tion. Using these activation values, a convolution operation 
and pooling are performed again. Finally, these output val¬ 
ues are passed through the two fully connected layers and 
then classified using softmax. 

For training our network, the stochastic gradient descent 
method is used for the optimization, and the dropout [(] and 
weight decay methods are utilized for regularization. 

Our network is not too deep and there are not many pa¬ 
rameters to avoid overfitting, since the size of the facial ex¬ 
pression recognition database is too small: there are only 
205 sequences in the MMI database. Also, the first layer 
turns out to detect the temporal difference of the appearance 
over input image sequences as discussed in Section 5. 

4.3. Deep Temporal Geometry Network 

DTGN receives the trajectories of facial landmark points 
as input. These trajectories can be considered as one¬ 
dimensional signals and defined as follows : 
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where n is the total number of landmark points at frame t, 
and X® is a 2 n dimensional vector at t. x^ and are 
coordinates of the fc-th facial landmark points at frame t. 

These ^-coordinates are inappropriate for direct use as 
input in the deep network, because they are not normalized. 
For the normalization of the ^-coordinates, we first sub¬ 
tract the nose position of the face from each point (the posi¬ 
tion of the red point among the facial landmark points in the 
red box with the dotted line in Figure 2). Then, each coordi¬ 
nate is divided by each standard deviation of ^-coordinates 
in each frame as follows: 
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where is ^-coordinate of the k -th facial landmark point 
at frame t , is ^-coordinate of the nose landmark coordi¬ 
nate at frame t. a ^ is standard deviation of ^-coordinates 
at frame t. This process is also applied to the yf\ Fi¬ 
nally, these normalized points are concatenated along the 
time, and these points are used for the input to the DTGN. 
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where X is a 2 nT g dimensional input vector, and x k 9J and 

y k 9) are coordinates of k -th normalized landmark points at 
frame T g . 

The figure in the red box with a dotted line in Figure 2 il¬ 
lustrates the architecture of our DTGN model. Our network 
receives the concatenated landmark points X as input. Ba¬ 
sically, we utilize two hidden layers, and the top layer is a 
softmax layer. Similar to the DTAN, this network is also 
trained by using the stochastic gradient descent method. 
The activation function for each hidden layer is ReLU. Fur¬ 
thermore, for regularization of the network, dropout [ 6 ] and 
weight decay are used. 

4.4. Data Augmentation 

In order to better classify unobserved data, a number 
of training data are required. However, facial expression 
databases, such as CK+, Oulu-CASIA, and MMI, provide 
only sequences of hundreds. This makes a deep network 
easier to over-fit, because a deep network has many param¬ 
eters. To prevent this problem, various data augmentation 
techniques are required. 

First, whole image sequences are horizontally flipped. 
Then, each image is rotated according to each angle in 
{ — 15°, —10°, —5°, 5°, 10°, 15°}. This makes the model 
robust against the slight rotation changes of the input im¬ 
ages. Finally, we obtain fourteen times more data: original 
images ( 1 ), flipped images ( 1 ), rotated images for each an¬ 
gle, and their flipped versions ( 12 ). 

Similar to the augmentation of image sequences, the nor¬ 
malized facial landmark points are also horizontally flipped. 
Then, Gaussian noise is added to the raw landmark points. 


X^ = , 
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Figure 3. Filters Learned by Single Frame-based CNN. The in¬ 
put image size was 64 x 64, and the filter size was 5x5. 18 
filters were selected for visualization from 64 learned filters in the 
first convolutional layer. The black and white colors represent the 
negative and positive values, respectively. There were several di¬ 
rectional edge and blob detection filters. 



Figure 4. Feature Maps Corresponding to Figure 3. The left 
image represents the input image, and the right image shows the 
feature maps extracted by each filter in Figure 3. The emotion la¬ 
bel for the input image was surprise. The blue and red values rep¬ 
resent the low and high response values, respectively. The edges 
of the input image are detected in most of the filter. 


for i = 1 ,..., n where x^ and y^ are j-th rotated xy- 
coordinates at time t, and is a 2 x 2 rotation matrix for 
the ^-coordinates at time t, which has an angle 6^\ The 
value of is drawn from a uniform distribution where 
Unif [/3, 7 ]. We set the values (3 and 7 to — 7 r /10 
and 7r/10, respectively. 

We performed the first data augmentation methods in 
equation 5 three times, and the second data augmentation in 
equation 6 was also conducted three times. Consequently, 
we obtained six times more facial landmark points. As a 
result, we augmented the training data fourteen times: orig¬ 
inal coordinates ( 1 ), flipped coordinates ( 1 ) and two aug¬ 
mentation methods for each ( 12 ). 

4.5. Model Integration 

The outputs from the top layers of the two networks were 
integrated using equation 7. 


where zf^ ^ 7V(0, of) is additive noise with noise level 
<j{ for the x-coordinate of the i-th landmark points at frame 
t. We set the value (Ji to 0.01. Also, we contaminated y- 
coordinate with noise in the same way. This method pre¬ 
vents slight pose changes. 

To prepare for rotation changes, we construct rotated the 
data as follows: 






( 6 ) 


Oi = api + (1 - a)qi, 0 < a < 1, (7) 

for i = 1 ,..., c where c is the total number of emotion 
class, pi , qi are each output of DTAN and DTGN, and Oi is 
the final score. Finally, the index with the maximum value 
is the final prediction. This parameter a depends on the 
performance of each network. If the performances of the 
two networks are similar to each other, the value of a is 0.5 
(e.g., the experiments on the CK+ and Oulu-CASIA). 
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(a) (b) (c) 

Figure 7. Visualization of Representation Extracted by DTGN (a) The important top-10 positions detected by our network (red points 
in the left figure). In order to visualize these ten points, we calculated the average of the absolute values of weights in the first FC layer 
connected to each landmark point. Then, these values were sorted by descending order, and top 10 points with highest value were selected. 
The action parts defined by EFACS [ ] are shown in the right figure (green colored area), (b) Visualization of original input data in CK+ 
database, using t-SNE [18]. The number of data was 4149: (327-33) x 14 augmented training data and 33 test data. The small dots and large 
squares represent training and test data, respectively. The numbers in the legend correspond to each label of the CK+ database: 0-anger, 
1-contempt, 2-disgust, 3-fear, 4-happiness, 5-sadness, and 6-surprise, (c) Visualization of the outputs in the second hidden layer. The data 
points were automatically grouped by DTGN. 
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Figure 5. Filters Learned by DTAN. The number of input frames 
was three in this figure, so there are three filters corresponding to 
each frame. The three filters in each bold black box generate one 
feature map. As with Figure 3, 18 filters were selected from 64 
learned filters. In this figure, we guess that our network detects 
differences between frames. 



Figure 6. Feature Maps Corresponding to Figure 5. The gray 
images on the left side form the image sequence used as input, and 
the images on the right side are the feature maps corresponding to 
each filter in Figure 5. Blue and red represent the low and high 
response values. The emotion label for the input image sequence 
was surprise. We observed that our network responded to moving 
parts for expressing emotion. 


5. What Will Deep Networks Learn? 

5.1. Visualization of DTAN 

To find out what our DTAN learned, the learned filters 
were visualized. Figure 3 demonstrates the filters learned 
by single frame-based CNN in the first convolutional layers 


using the CK+ database. The filters were similar to the edge 
or blob detectors. Corresponding responses to each filter 
are provided in Figure 4. The edge components according 
to the direction were detected by each filter. 

In our DTAN, which is a multiple frame-based CNN, the 
learned filters are shown in Figure 5. Unlike the filters of 
single frame-based CNN, the filters were not edge or blob 
detectors. To exaggerate a little, these were just combina¬ 
tions of black, gray, and white filters. Figure 6 shows the 
meaning of these filters. High response values were usually 
shown in parts with big differences between input frames. 
In other words, we know that the first convolutional layer 
of our DTAN detects facial movements arising from the ex¬ 
pression of emotion. 

5.2. Visualization of DTGN 

The left side of Figure 7 (a) shows the significant facial 
landmark points for facial expression recognition. These 
positions were automatically found by DTGN. The ex¬ 
tracted positions were very similar those of EFACS on the 
right side in Figure 7 (a). To explain it further, the two 
extracted points on the nose become wider when people 
make a happy expression because both cheeks are pulled 
up. Also, our network did not catch the eyelid, because the 
database includes the eye blinking action. 

In order to figure out the characteristics of the features 
extracted from the top layer, we also visualized the aug¬ 
mented input feature vectors using t-SNE, which is a useful 
tool for visualization of high dimensional data [18]. The in¬ 
put data were spread randomly in Figure 7 (b), but the fea¬ 
tures extracted from the second hidden layer were separated 
according to their label, as shown in Figure 7 (c). 






















6. Experiments 

For assessing the performance of our method, we 
used three databases: the CK+, Oulu-CASIA, and MMI 
databases. The number of image sequences in each database 
are listed according to each emotion in Table 1 . In the exper¬ 
iments, the algorithms which were not mentioned in Section 
2 were used such as manifold based sparse representation 
(MSR) [14], AdaLBP [21], Atlases [5], and common and 
specific active patches (CSPL) [23]. 

6.1. Implementation 

First, we normalized the original image sequences to 
twelve frames using the code downloded from [24] for in¬ 
put. In the case of MMI, we normalized twenty four frames, 
because the total number of frames in the original image se¬ 
quences was more than in the other databases. Then, we 
selected twelve frames on the front side. In order to de¬ 
tect a face in the normalized image sequences, we used 
the OpenCV Haar-like detector [2], and the detected faces 
were used for the initial position of the IntraFace [20] . For 
DTAN, we selected three face images (1st, 7th, and 12th 
frames). The architecture of our deep network was imple¬ 
mented by CudaConvnet2 [10]. 

6.2. CK+ 

Description of the Database. CK+ is a representative 
database for facial expression recognition. This database 
is composed of 327 image sequences with seven emotion 
labels: anger, contempt, disgust, fear, happiness, sadness, 
and surprise. There are 118 subjects, and these subjects are 
divided into ten groups by ID ascending order. Nine subsets 
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Figure 8. Comparison of Accuracy in the CK+ according to each 
emotion among three networks. 


were used for training our networks, and the remaining sub¬ 
set was used for validation. This process is the same as the 
10-fold validation protocol in [] ]. In this database, each 
sequence starts with a neutral emotion and ends with a peak 
of the emotion. 

Details of the Architecture. The architecture of DTGN 
is D1176-FC100-FC600-S7. D1176 is a 1176 dimensional 
input vector, and FC100 refers to a fully connected layer 
with 100 nodes. Also, S7 is the softmax layer with seven 
outputs. The rates for dropout were set to 0.1, 0.5, 0.5 for 
input and two hidden layers, respectively. 

Our DTAN model for CK+ is I64-C(5,64)-L5-P2- 
C(5,64)-L3-P2-FC500-FC500-S7, where 164 means 64 x 
64 input image sequences, and C(5,64) is a convolutional 
layer with 64 5 x 5 filters. L5 is a local contrast normaliza¬ 
tion layer with a window size of 5 x 5. P2 means a 2 x 2 
max pooling layer. The stride of each layer was the same as 
1 with the exception of the pooling layer. The value of the 
stride for each pooling layer was set to 2. The a mentioned 
in Section 4.5 was set to 0.5. The dropout rates for DTAN 
were set to 0.1, 0.5, 0.5 for input and two fully connected 
layers, respectively. 


Table 1. The Number of Image Sequences for Each Emotion: 

anger (An), contempt (Co), disgust (Di), fear (Fe), happiness (Ha), 
sadness (Sa), and surprise (Su). 
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Co 

Di 

Fe 

Ha 

Sa 

Su 

An 

100 

0 

0 

0 

0 

0 

0 

Co 

0 

94.44 

0 

0 

0 

5.56 

0 

Di 

0 

0 

100 

0 

0 

0 

0 

Fe 

0 

0 

0 

84 

8 

0 

8 

Ha 

0 

0 

0 

0 

100 

0 

0 

Sa 

10.71 

0 

0 

3.57 

0 

85.71 

0 

Su 

0 

1.2 

0 

0 

0 

0 

98.8 


Table 3. Confusion Matrix for the CK+ Database. The labels in 
the leftmost column and on the top represent the ground truth and 
prediction results, respectively. 


Results. The total accuracy of 10-fold cross validation is 
shown in Table 2. The performances of DTAN and DTGN 
are lower than other algorithms, but the performance of the 
integrated network is better than other state-of-the-art algo¬ 
rithms. 

The two networks were complementary, and this is 
shown in Figure 8. The DTAN had a good performance 
with respect to contempt, whereas it had lower accuracy 
with fear. On the other hand, the geometry-based model 
was strong with fear. 

Table 3 shows the confusion matrix for CK+. Our al¬ 
gorithm has performed well in recognizing anger, disgust, 
happiness, and surprise. For the other emotions, our method 
also performed well. 



























HOG 3D 

MSR 

TMS 

Cov3D 

STM-ExpLet 

3DCNN-DAP 

DTAN 

DTGN 

DTAGN 

Accuracy 

91.44 

91.4 

91.89 

92.3 

94.19 

92.4 

91.44 

92.35 

96.94 


Table 2. Overall Accuracy in the CK+ Database. The red and blue colors represent the first and second most accurate, respectively. 



3D SIFT 

LBP-TOP 

HOG 3D 

AdaLBP 

Atlases 

STM-ExpLet 

DTAN 

DTGN 

DTAGN 

Accuracy 

55.83 

68.13 

70.63 

73.54 

75.52 

74.59 

74.38 

74.17 

80.62 


Table 4. Overall Accuracy in the Oulu-CASIA Database. The red and blue colors represent the first and second most accurate, respec¬ 
tively. 


6.3. Oulu-CASIA 

Description of the Database. For further experiments, we 
used Oulu-CASIA, which includes 480 image sequences 
taken under the normal illumination condition. Each im¬ 
age sequence has one of six emotion labels: anger, disgust, 
fear, happiness, sadness, or surprise. There were 80 sub¬ 
jects, and the 10-fold validation was performed in the same 
way as in the case of CK+. Similar to the CK+ database, 
each sequence begins with a neutral facial expression and 
ends with the facial expression of each emotion. 

Details of the Architecture. For Oulu-CASIA, the ar¬ 
chitecture of DTGN was D1176-FC100-FC600-S6. The 
DTAN model was also the same as the DTAN model of 
CK+ except the number of nodes in the top layer, because 
there are six labels in Oulu-CASIA. The value of a was set 
to 0.5. The dropout rates for DTAN and DTGN were the 
same as the rates in CK+. 

Results. The accuracy of our algorithm was superior to 
the other state-of-the-art algorithms, as shown in Table 4. 
The best performance from among the existing methods was 
75.52%, which was achieved by Atlases, and this record had 
not been broken for three years. However, we have signifi¬ 
cantly improved the accuracy by about 5%. 

In Figure 9, the performance of two networks was com¬ 
pared. Similar to the case of CK+, we observed that the two 
networks were complementary to each other. In particular, 
the performance of the DTGN in the case of disgust was 
lower than the DTAN, but the combined model produced 
good results. 

Table 5 shows the confusion matrix for our algorithm. 
The performance in the cases of happiness, sadness, and 
surprise was good, but the performance for anger, disgust, 
and fear was relatively poor. In particular, anger and disgust 
were confused in our algorithm. 

6.4. MMI 

Description of the Database. MMI consists of 205 im¬ 
age sequences with frontal faces and includes only 30 sub¬ 
jects. Similar to the Oulu-CASIA database, there are six 
kind of emotion labels. This database was also divided into 
ten groups for person independent 10-fold cross validatiaon. 


Accuracy 



An Di Fe Ha Sa Su Total Emotion 

Figure 9. Comparison of Accuracy in the Oulu-CASIA accord¬ 
ing to each emotion among three networks. 

This database is different to the other databases, each 
sequence begins with a neutral facial expression, and has 
the facial expression of each emotion in the middle of the 
seqence. This ends with the neutral facial expression. The 
peak frame was not provided as a prior information. 

Details of the Architecture. We used the DTGN model 
of D1176-FC100-FC70-FC6 for the MMI database. The 
number of subjects and image sequences was very small, so 
we decreased the number of nodes significantly. 

Our DTAN model was designed as I64-C(7,64)-L5-P2- 
C(5,32)-L3-P2-C(3,32)-L3-P2-FC300-FC300-6. Differing 
from the other two databases, this database had many pose 
changes. Also, there were a variety of environments, such 
as like lighting changes. Consequently, we normalized the 
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Table 5. Confusion Matrix for the Oulu-CASIA Database. The 

labels in the leftmost column and on the top represent the ground 
truth and prediction results, respectively. 











































HOG 3D 

3D SIFT 

ITBN 

CSPL 

STM-ExpLet 

3DCNN-DAP 

DTAN 

DTGN 

DTAGN 

Accuracy 

60.89 

64.39 

59.7 

(73.53) 

75.12 

63.4 

58.05 

56.1 

66.33 


Table 6. Overall Accuracy in the MMI Database. The red and blue colors represent the first and second most accurate, respectively. The 
CSPL used additional ground truth information, so it was excluded from the ranking. 


Accuracy 

100 



An Di Fe Ha Sa Su Total Emotion 

Figure 10. Comparison of Accuracy in the MMI according to 
each emotion among three networks. 


input faces using the eye coordinates, and local contrast 
normalization was used for input image sequences. The 
dropout rates for DTAN and DTGN were the same as the 
rates in CK+. 

The value of a was different than in the other experi¬ 
ments. We set the value as 0.42, and this value was manu¬ 
ally determined. 

Results. In Table 6, our algorithm was good as the second. 
The CSPL algorithm was excluded for the ranking, because 
the CSPL used the peak frame number, which is an addi¬ 
tional type of ground truth information. 

We compared two networks and the combined model in 
Figure 10. The two networks were complementary to each 
other for most of the emotions. However, with fear, our al¬ 
gorithm was degraded. This is also shown in the confusion 
matrix in Table 7. We observed that the accuracy for fear 
was much lower than other emotions. In particular, most 
of the fear emotions were confused with surprise. To de¬ 
termine why this phenomenon appears, we checked all the 
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Table 7. Confusion Matrix for the MMI Database. The labels in 
the leftmost column and on the top represent the ground truth and 
prediction results, respectively. 



Figure 11. All Failure Cases with Fear in the MMI database. 

Our deep network predicted fear to surprise (green box), anger (red 
box), disgust (yellow box), sadness (orange box), and happiness 
(blue box). 


failure cases as shown in Figure 11 . 

The results indicated that a variety of facial expressions 
represented fear. Many cases were similar to surprise or 
sadness. To train for these various expressions, various 
kinds of training data are additionally required. However, 
we had only 27 subjects for training data. (Three sub¬ 
jects were used for validation.) Unfortunately, performance 
of deep learning techniques highly depends on the quality 
of training data, so our accuracy with fear was not good 
enough. 

7. Conclusion 

We presented two deep network models that collaborate 
with each other. The first network was DTAN, which was 
based on multiple frames, while the second network was 
DTGN, which extracted useful temporal geometric features 
from raw facial landmark points. We showed that the fil¬ 
ters learned by the DTAN in the first layer have the ability 
to obtain the difference between the input frames. Further¬ 
more, the important landmark points extracted by DTGN 
were also shown. As a result, we achieved best recognition 
rates using the integrated deep network with the CK+ and 
Oulu-CASIA databases. However, in the MMI database, 
our algorithm had a lower accuracy because there were only 
30 subjects, which is too small a sample size to make cor¬ 
rect predictions using deep learning models. 
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